Skip to content

feat: add docker agent serve chat command (OpenAI-compatible API)#2510

Open
dgageot wants to merge 19 commits intodocker:mainfrom
dgageot:board/add-docker-agent-serve-chat-command-0b138539
Open

feat: add docker agent serve chat command (OpenAI-compatible API)#2510
dgageot wants to merge 19 commits intodocker:mainfrom
dgageot:board/add-docker-agent-serve-chat-command-0b138539

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented Apr 25, 2026

Fixes #2502.

Exposes any docker-agent agent through an OpenAI-compatible HTTP server, so any tool that already speaks the Chat Completions protocol (Open WebUI, the official openai SDKs, ad-hoc curl scripts, etc.) can drive an agent without a custom integration.

Endpoints

Method Path Notes
GET /v1/models Lists exposed agents as OpenAI models
POST /v1/chat/completions Runs the agent; supports stream: true (SSE) and false

Usage

docker agent serve chat ./agent.yaml                       # localhost:8083
docker agent serve chat ./team.yaml --agent reviewer       # pin one agent
docker agent serve chat agentcatalog/pirate --listen :9090
curl -sS -X POST http://127.0.0.1:8083/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"hi"}]}'

Design

  • The team is loaded once at startup and shared across requests. Each chat completion gets a fresh session and runtime.
  • The session is created with ToolsApproved=true and NonInteractive=true — there is no human in the loop. ElicitationRequestEvent is still explicitly declined to avoid hanging on the runtime's elicitation channel.
  • The model field of the request can pin a specific agent in a multi-agent team. If it doesn't match an exposed agent (e.g. clients that hard-code gpt-4) we silently fall back to the default agent and echo the requested model name back, so clients matching on the model field stay happy.
  • Streaming uses SSE in OpenAI's chat.completion.chunk format and ends with data: [DONE].

Implementation

  • New cobra command cmd/root/chat.go (default 127.0.0.1:8083, --agent / --listen flags) wired into cmd/root/serve.go.
  • New pkg/chatserver package, split across:
    • server.goRun, router, HTTP handlers, sseStream, error envelope
    • agent.goagentPolicy, buildSession, runAgentLoop, sessionUsage
    • types.go — request/response shapes
  • Reuses openai.Model from github.com/openai/openai-go/v3 for /v1/models. Other SDK response types serialise too noisily with stdlib encoding/json (the SDK relies on its internal apijson package, which lives under internal/), so the chat-completion shapes are hand-rolled for clean output.

Tests

  • Unit tests for session-building, agent-policy resolution, usage extraction.
  • HTTP-level tests via httptest for /v1/models shape, the three early-validation paths of /v1/chat/completions (bad JSON, empty messages, history without user), and writeError's status→type mapping.

Validation

  • mise lint — 0 issues
  • mise test — all packages green
  • Manual curl smoke test against examples/42.yaml: /v1/models returns the agent, error paths return correct OpenAI-shaped envelopes.

@dgageot dgageot requested a review from a team as a code owner April 25, 2026 18:00
@dgageot dgageot force-pushed the board/add-docker-agent-serve-chat-command-0b138539 branch from 40d4ebc to 8711dac Compare April 25, 2026 20:25
trungutt
trungutt previously approved these changes Apr 26, 2026
@dgageot dgageot marked this pull request as draft April 26, 2026 10:37
dgageot added a commit to dgageot/cagent that referenced this pull request Apr 27, 2026
Demonstrates the OpenAI-compatible HTTP server introduced in PR docker#2510.

Uses the official github.com/openai/openai-go SDK pointed at the local

chat server's /v1 base URL and runs an interactive REPL with streaming,

history retention, and graceful Ctrl-C shutdown.

Run `docker agent serve chat ./agent.yaml` in one terminal, then

`go run ./examples/chat` in another.

Assisted-By: docker-agent
@dgageot dgageot force-pushed the board/add-docker-agent-serve-chat-command-0b138539 branch from 8711dac to 594ad2b Compare April 27, 2026 12:40
@dgageot
Copy link
Copy Markdown
Member Author

dgageot commented Apr 27, 2026

Update — expanded scope

I pushed a force-update onto this branch with 18 additional commits on top of the original feat: add docker agent serve chat command. Many of them implement the "easy wins" from the original PR's "Limitations / future work" section, plus a working Go example, plus a couple of bug fixes uncovered while reviewing the new code.

New commits (oldest → newest)

Example

  • examples: add minimal chat client for docker agent serve chatexamples/chat/main.go runs an interactive REPL against the server using the official github.com/openai/openai-go SDK, demonstrating the OpenAI compatibility end-to-end.

Hardening (trivial / opt-in, safe defaults)

  • chatserver: replace * CORS with --cors-origin flag — wildcard removed; CORS off by default.
  • chatserver: enforce max body size and per-request timeout--max-request-size (1 MiB default) and --request-timeout (5 min default).
  • chatserver: collect every runtime ErrorEvent (errors.Join) — no more swallowed errors after the first.
  • chatserver: emit structured error events on streaming failures — proper finish_reason: "error" + error envelope, instead of [error: …] jammed into delta.content.
  • chatserver: parse and validate OpenAI sampling parameterstemperature, top_p, max_tokens, stop (string-or-array union) are now declared, range-checked and rejected with 400 on bad input.

Auth & deployment

  • chatserver: add Bearer-token auth (--api-key) — opt-in static bearer token (constant-time compare; OPTIONS preflight + /openapi.json exempted).
  • chatserver: support comma-list and regex in --cors-origin — allow-list, ~regex patterns, scheme/path validation.

Performance

  • chatserver: support X-Conversation-Id for stateful sessions — opt-in LRU + TTL cache so clients don't have to resend the full history every turn (--conversations-max, --conversation-ttl).
  • chatserver: pool runtimes per agent for warm reuse — small pool of warm runtimes per agent to avoid the per-request cost of runtime.New (--max-idle-runtimes).

Protocol surface

  • chatserver: surface agent tool calls as OpenAI tool_calls — agent-invoked tools are emitted in OpenAI's tool_calls shape on both streaming and non-streaming responses (informational; tools still execute server-side).
  • chatserver: serve /openapi.json for schema introspection — embedded OpenAPI 3.1 document; bypasses bearer auth.
  • chatserver: accept OpenAI multimodal content (text + image_url)content accepts the union of string and typed-parts arrays; image parts now reach the runtime via chat.MultiContent.

Bug fixes (found by review pass)

  • fix(chatserver): always store conversation after requestmaybeStoreConversation used to skip Put for existing conversations, so if the entry was evicted by another request mid-flight the updated session was lost.
  • refactor(chatserver): remove unused isNew parameter (follow-up).
  • test(chatserver): add test for conversation restore after eviction (follow-up).
  • chatserver: restore doc comment on chatCompletion — fixes a doc-comment regression introduced by the eviction fix.
  • chatserver: serialize requests sharing an X-Conversation-Id — concurrent requests sharing a conversation id used to share the same *session.Session and race on it. We now reject the second concurrent request with 409 Conflict, surfacing the misuse instead of producing a garbled transcript. Race-detector clean.

CI

  • go build ./... — clean.
  • go test -race ./pkg/chatserver/... ./cmd/root/... — clean.
  • golangci-lint run ./pkg/chatserver/... ./cmd/root/... ./examples/chat/... — 0 issues.

Breaking changes

None on the wire (the API only gains new optional features). For programmatic Go callers of chatserver.Run, the signature has changed from
Run(ctx, agentFilename, agentName, runConfig, ln) to Run(ctx, agentFilename, opts Options, ln); nothing outside this PR uses it.

Happy to split this into separate PRs if reviewers prefer; commit messages are written to be cherry-pickable.

Assisted-By: docker-agent

dgageot added 12 commits April 27, 2026 15:46
Expose any docker-agent agent through an OpenAI-compatible HTTP
server, so tools that already speak the Chat Completions protocol
(Open WebUI, the official `openai` SDKs, ad-hoc curl scripts, etc.)
can drive an agent without any custom integration.

Endpoints:
  GET  /v1/models             — lists exposed agents as OpenAI models
  POST /v1/chat/completions   — runs the agent; supports stream: true
                                (Server-Sent Events) and false

The team is loaded once at startup and shared across requests; each
chat completion gets a fresh session and runtime. Tool calls and
elicitation prompts are auto-handled (this is a non-interactive
endpoint). The `model` field can pin a specific agent in a multi-
agent team, or is ignored and the team's default agent runs.

Implementation notes:

- New `cmd/root/chat.go` cobra command (default 127.0.0.1:8083,
  --agent / --listen flags) wired into `cmd/root/serve.go`.
- New `pkg/chatserver` package, split into:
  - server.go — Run, router, HTTP handlers, sseStream, errors
  - agent.go  — agentPolicy, buildSession, runAgentLoop, sessionUsage
  - types.go  — request/response shapes
- Reuses `openai.Model` from github.com/openai/openai-go/v3 for
  /v1/models. Other OpenAI SDK response types serialise too noisily
  with stdlib `encoding/json` (the SDK relies on its internal
  `apijson` package which we can't import), so request/response
  shapes are hand-rolled for clean output.
- Defensive event handling in runAgentLoop: ToolsApproved=true and
  NonInteractive=true mean the runtime never blocks for confirmation
  in normal flow, but ElicitationRequestEvent must still be answered
  or the runtime would hang on its dedicated channel.

Tests cover session-building, agent-policy, error-envelope shape,
and the three early-validation paths of /v1/chat/completions via
httptest. Validated with `mise lint` (0 issues), `mise test` (all
packages green), and a curl smoke test against examples/42.yaml.

Fixes docker#2502

Assisted-By: docker-agent
Demonstrates the OpenAI-compatible HTTP server introduced in PR docker#2510.

Uses the official github.com/openai/openai-go SDK pointed at the local

chat server's /v1 base URL and runs an interactive REPL with streaming,

history retention, and graceful Ctrl-C shutdown.

Run `docker agent serve chat ./agent.yaml` in one terminal, then

`go run ./examples/chat` in another.

Assisted-By: docker-agent
The chat server used to set `Access-Control-Allow-Origin: *` on every
response, which makes it unsafe to expose on anything other than
loopback. Replace the wildcard with an explicit per-server allow-list
of one origin and disable the CORS middleware entirely when the flag
is empty.

- Introduce `chatserver.Options` so future improvements can extend the
  server configuration without breaking the `Run` signature on each
  change.
- Add `--cors-origin` flag to `docker agent serve chat`. Default empty
  = no CORS headers emitted.
- Update tests; fix three pre-existing `noctx` lint failures in
  handlers_test.go that surfaced when the PR was rebased onto current
  main.

Assisted-By: docker-agent
Hostile or buggy clients could previously stream gigabytes into the
chat completions endpoint or hold a goroutine open indefinitely on a
slow upstream model. Cap both via Echo middleware:

- `BodyLimit` defaults to 1 MiB (configurable via
  `--max-request-size`). Oversized bodies now return 413 instead of
  being silently buffered.
- A new `requestTimeoutMiddleware` wraps `c.Request().Context()` in
  `context.WithTimeout` so model + tool calls + SSE streaming all
  share a single deadline. Default 5 minutes, configurable via
  `--request-timeout`.

Both limits are exposed on `chatserver.Options` (`MaxRequestBytes`,
`RequestTimeout`); zero values fall back to package defaults.

Tests cover oversized body rejection and deadline propagation through
the middleware chain.

Assisted-By: docker-agent
Previously runAgentLoop would record only the first ErrorEvent and
drop every subsequent one on the floor while still draining the
stream. That made debugging a multi-error run frustrating: only the
earliest symptom was ever surfaced, even though later events often
held the actual root cause (a model timeout followed by a tool call
that couldn't connect, for instance).

Switch to a slice of errors and join them with `errors.Join` at the
end. The handler's behaviour for callers is unchanged when a single
error occurs; multi-error runs now surface a wrapped error whose
`Unwrap() []error` makes each cause inspectable.

Assisted-By: docker-agent
Until now a runtime error mid-stream was injected into the assistant
content as `[error: ...]` and the stream still closed with
`finish_reason: "stop"`. Clients matching on the OpenAI protocol had
no programmatic way to tell a successful completion apart from a
failed one.

Switch to OpenAI's actual on-the-wire shape: emit a separate
`data: {"error": {...}}` envelope, then terminate the stream with
`finish_reason: "error"` before the `[DONE]` sentinel. Successful
runs continue to terminate with `finish_reason: "stop"`.

Add a unit test on the new `sseStream.sendError` covering the wire
format.

Assisted-By: docker-agent
OpenAI clients regularly send `temperature`, `top_p`, `max_tokens`,
and `stop` on every chat completion request. The server used to drop
them silently because the request struct didn't declare them, so
typos and out-of-range values went unnoticed until the upstream
provider eventually returned an opaque error several seconds later.

- Add `Temperature`, `TopP`, `MaxTokens`, `Stop` to
  `ChatCompletionRequest` so the OpenAPI schema matches what the
  wire protocol allows.
- `Stop` is JSON-flexible: clients send either a single string or an
  array, and OpenAI accepts both. Custom `UnmarshalJSON` handles the
  union shape.
- `validateSamplingParams` range-checks the new fields and rejects
  bad input with a 400 invalid_request_error, matching how OpenAI
  itself behaves.

Plumbing these values through the runtime to the model layer
requires per-request overrides that don't exist today; that work is
tracked separately. Validating up front is the user-visible win and
unblocks future plumbing.

Assisted-By: docker-agent
The chat server is unauthenticated by default, which is fine on
loopback but unsafe anywhere else. Add an opt-in static bearer-token
gate so the server can be safely bound to a LAN interface.

- `chatserver.Options.APIKey`: when non-empty, every request to /v1/*
  must carry `Authorization: Bearer <token>` or it is rejected with
  401. Empty preserves the previous unauthenticated behaviour.
- `bearerAuthMiddleware` uses `subtle.ConstantTimeCompare` to dodge
  timing-side-channel leaks. CORS preflight (OPTIONS) is exempted so
  browsers can negotiate before sending the auth header.
- `--api-key` and `--api-key-env` flags expose the option from the
  CLI; the env-var form keeps secrets out of process listings.

Tests cover missing/wrong/correct tokens and the OPTIONS exemption.

Assisted-By: docker-agent
Until now the server was strictly stateless: every chat completion
request rebuilt a fresh session from the messages array, so clients
paid the tokenization cost of replaying the full history on every
turn. That works but is wasteful for long conversations.

Add an opt-in conversation cache:

- `chatserver.Options.ConversationsMaxSessions` enables an in-memory
  LRU keyed by the `X-Conversation-Id` request header.
  `Options.ConversationTTL` (default 30 min) bounds idle lifetime;
  expired entries are evicted lazily on access and on Put.
- When a request carries a known id, the server reuses the existing
  session and only appends the latest user message from the request
  body. The session already has the prior turns. When the id is
  unknown (or the header is absent), the server falls back to the
  previous behaviour and builds a session from scratch.
- New `--conversations-max` and `--conversation-ttl` CLI flags
  expose the feature. Default 0 keeps the old stateless behaviour.

The cache implementation is a simple map + mutex with O(n) LRU
scan; that's appropriate for the small caches typical for this
feature, and avoids pulling in a new dependency.

Tests cover Put/Get, TTL expiry, LRU eviction, Delete, and the new
appendLatestUser helper.

Assisted-By: docker-agent
Every chat completion request used to call `runtime.New` from
scratch: that resolves the agent's tools, builds per-agent hook
executors, and allocates per-runtime resume/elicitation channels.
On a busy server those allocations show up in profiles.

Add an opt-in pool so a small number of warm runtimes per agent can
be reused across requests:

- `chatserver.Options.MaxIdleRuntimes` (default 4 via `--max-idle-
  runtimes`) bounds the idle pool size per agent. 0 disables pooling
  entirely and restores the original "fresh runtime per request"
  behaviour.
- `runtimePool.Get` returns a recycled runtime when one is idle, or
  creates a new one. `Put` returns it to the pool on completion;
  overflow is dropped on the floor (the team owns the toolsets, so
  nothing leaks).
- A runtime is *not* safe for concurrent `RunStream` calls (its
  resume/elicitation channels are per-runtime), so the pool hands
  out at most one borrow per runtime at a time. Concurrency comes
  from holding multiple runtimes per agent.

Assisted-By: docker-agent
The previous commit only accepted a single literal origin. Real
deployments often need to allow several front-ends or all subdomains
of a known SaaS. Extend the flag's grammar:

- comma-separated entries form an explicit allow-list, each matched
  exactly;
- entries prefixed with `~` are compiled as Go regex and matched
  against the request's `Origin` header at request time;
- the literal `*` wildcard is preserved for the (rare) cases where
  the operator really wants it;
- literal entries are validated up front: scheme must be http/https,
  no path/query/fragment, no missing host. Mistakes are caught at
  startup rather than producing silent allow-none behaviour at
  runtime.

When the spec parses cleanly to nothing usable, the middleware is
left unregistered and a slog.Error documents the misconfiguration.

Tests cover the parser's accept/reject set and exercise allow-list +
regex routing through the real Echo middleware.

Assisted-By: docker-agent
When the agent invokes a tool, clients had no way to see what
happened: tools ran inside the runtime, the assistant's eventual
text output sometimes referenced them but often didn't, and the
streaming protocol carried only the model's plain content. That's
fine for a black-box transcript but useless for a chat UI that
wants to render "🔧 calling search(query=…)" badges.

Use OpenAI's standard `tool_calls` shape on both response styles:

- Add `ToolCallReference` (mirrors OpenAI's tool_call entry) with
  `index`, `id`, `type`, `function.{name,arguments}`.
- `ChatCompletionMessage.ToolCalls` populated on the non-streaming
  response so the assistant message lists every tool the agent
  invoked.
- `ChatCompletionStreamDelta.ToolCalls` carries one tool per delta
  in streaming mode. The runtime hands us complete arguments, so
  one chunk per call is sufficient (vs. OpenAI's incremental
  argument streaming, which clients accumulate either way).
- `runAgentLoop` now takes an `agentEmit` struct with
  `onContent` and `onToolCall` hooks instead of a single content
  callback. Both handlers fill in their respective hooks; missing
  ones are no-ops.

Tools still execute server-side; this commit is purely about
client observability. Surfacing results back through the protocol
(so clients could intercept / replay them) is left for a future
change.

Assisted-By: docker-agent
@dgageot dgageot force-pushed the board/add-docker-agent-serve-chat-command-0b138539 branch from 594ad2b to 7f16975 Compare April 27, 2026 13:48
dgageot added 7 commits April 27, 2026 15:56
Add a static OpenAPI 3.1 document describing /v1/models,
/v1/chat/completions, the new tool_calls fields, the
X-Conversation-Id header, and the bearer-auth security scheme.

- The spec is hand-written and embedded with `//go:embed`. That
  keeps it easy to review (it's plain JSON, not generated noise),
  trivial to update when the API changes, and free of generation
  steps in the build.
- A new `GET /openapi.json` route serves the spec verbatim.
- `bearerAuthMiddleware` exempts /openapi.json so introspection
  tooling can discover the API even on locked-down deployments —
  there's no secret in the spec, only the shape of the API.

Tests cover both the document shape (correct paths advertised) and
the auth bypass.

Assisted-By: docker-agent
OpenAI's chat protocol lets the `content` field of a message be
either a string or an array of typed parts:

    "content": [
      {"type": "text", "text": "What is in this picture?"},
      {"type": "image_url", "image_url": {"url": "..."}}
    ]

The chat server used to drop the parts variant on the floor: the
field was typed as `string`, so multi-part requests deserialised
to an empty content and the request was rejected as having "no
user message". That made the server unable to serve any
vision-capable agent.

- Replace the plain `Content string` with a JSON-union
  (un)marshaller. `Content` still carries a flat-text view for
  string-form content and for the concatenated text of parts; a
  new `Parts []ContentPart` field holds the typed entries when the
  array shape is used. Existing Go callers (and every test that
  still writes `Content: "..."`) keep working unchanged.
- `convertParts` translates the wire shape to the runtime's
  `chat.MessagePart` union (text + image_url), so the model
  provider sees the actual image. Unknown part types are dropped
  gracefully so future part kinds degrade rather than 500.
- `appendLatestUser` (used by X-Conversation-Id continuation) gets
  the same multi-part path.
- The OpenAPI spec advertises the union shape and the new
  ContentPart schema.

Tests cover string/array round-trips, image_url plumbing into the
session, and (still passing) all the pre-existing behaviour.

Assisted-By: docker-agent
When a conversation is evicted from the LRU cache while a request is
processing it, the updated session was not being stored back because
maybeStoreConversation only called Put when isNew=true.

This caused conversation state to be lost when:
1. Request R1 retrieves conversation C from cache (isNew=false)
2. R1 processes the request, updating the session
3. Meanwhile, C is evicted due to LRU policy
4. R1 finishes and calls maybeStoreConversation(C, sess, false)
5. Since isNew=false, Put was not called
6. The updated session is lost

Fix: Always call Put, regardless of isNew flag. This ensures the
updated session is stored and refreshes the lastUsed timestamp,
preventing premature eviction of active conversations.

The Put operation is idempotent and safe to call multiple times for
the same conversation ID.

Assisted-By: docker-agent

Assisted-By: docker-agent
The isNew flag was used to decide whether to call Put on the
conversation store, but after the previous fix, we always call Put
regardless of whether the conversation is new or existing.

This commit removes the now-unused isNew parameter from
resolveSession and maybeStoreConversation, simplifying the code.

Assisted-By: docker-agent
Add a test that verifies a conversation evicted from the LRU cache
while a request is processing it can still be stored back after the
request completes.

This test validates the fix in commit 9563a43 which ensures
maybeStoreConversation always calls Put, preventing loss of session
state when a conversation is evicted during request processing.

Assisted-By: docker-agent
The previous fix accidentally deleted the doc-comment header line
on `(*server).chatCompletion`, leaving a dangling fragment
("// non-streaming OpenAI ChatCompletion object.") detached from
the function it documents.

Assisted-By: docker-agent
Concurrent requests with the same X-Conversation-Id share the same
`*session.Session` pointer (the conversation cache hands out the
same instance to every caller), so two simultaneous runtime
RunStream calls would interleave message appends, send overlapping
prompts to the model, and produce a garbled transcript.

Although `session.Session` has internal mutex protection on
Messages, the agent loop reads-then-writes (decide what to send,
append model output) so per-field synchronisation isn't enough —
the whole turn must be atomic with respect to other turns on the
same id.

Reject the second concurrent request with 409 Conflict instead of
trying to serialise it on the server. That:

- Surfaces the misuse to the caller immediately (vs. mysterious
  interleaving),
- Keeps server-side resources bounded (no queue, no parked
  goroutines),
- Matches how OpenAI's own conversation API expects clients to
  use the protocol (one request at a time per conversation).

Empty conversation id and nil lock-set are no-ops, so callers
without the feature enabled keep their old behaviour.

The OpenAPI spec advertises the new 409 response. Tests cover
acquire/release semantics, nil/empty no-ops, and a race-detector-
friendly stress test that proves at most one holder of the same
id at a time.

Assisted-By: docker-agent
@dgageot dgageot force-pushed the board/add-docker-agent-serve-chat-command-0b138539 branch from 7f16975 to 0e1c5a8 Compare April 27, 2026 13:58
@dgageot dgageot marked this pull request as ready for review April 27, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for OpenAI-compatible API server

2 participants